-
Notifications
You must be signed in to change notification settings - Fork 648
Fix performance regression for prefix caching #4270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a performance regression in prefix caching introduced by #4223 that caused increased cache misses when sending long and repeated requests in bursts. The fix enables caching and matching of prefix blocks from partially prefilled sequences.
Changes:
- Modified
CachePromptandPrefixMatchfunctions to allow caching and matching of blocks from partially prefilled sequences (wherecache_len < prompt.size()) - Changed
Materializereturn value attribute from[[nodiscard]]to[[maybe_unused]]since the return value is not always used - Replaced
FT_CHECKmacros withTM_CHECKvariants for consistency - Refactored code formatting and variable initialization in several files
- Added
async_parameter to benchmark script
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/turbomind/models/llama/SequenceManager.h | Changed Materialize attribute from [[nodiscard]] to [[maybe_unused]] |
| src/turbomind/models/llama/SequenceManager.cc | Enabled caching/matching for partially prefilled sequences; replaced FT_CHECK with TM_CHECK; cleaned up formatting |
| src/turbomind/models/llama/BlockTrie.cc | Refactored Match and Cache functions with clearer variable names and improved control flow |
| src/turbomind/models/llama/BlockManager.h | Removed unused function declaration |
| src/turbomind/models/llama/BlockManager.cc | Replaced FT_CHECK with TM_CHECK; removed unused function; added timestamp initialization |
| src/turbomind/engine/engine.cc | Removed unused variables; removed unnecessary lambda captures |
| benchmark/profile_pipeline_api.py | Added async_ parameter support |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| BlockIds block_ids; | ||
| UniqueIds block_unique_ids; | ||
| std::tie(block_ids, block_unique_ids) = block_trie_->Cache(seq, seq.prompt); | ||
| if (auto& seq = *sequences[i]; !seq.prompt.empty()) { |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition !seq.prompt.empty() should also check if there are blocks available to cache. When seq.cache_len < seq.prompt.size() but seq.blocks is empty or insufficient, attempting to cache could lead to unexpected behavior. Consider adding a check for seq.blocks.size() to ensure blocks exist before caching.
| if (auto& seq = *sequences[i]; !seq.prompt.empty()) { | |
| if (auto& seq = *sequences[i]; !seq.prompt.empty() && !seq.blocks.empty()) { |
| } | ||
|
|
||
| /// TODO: `Unlock` and `Lock` can't be batched because there may be repeated blocks between sequences | ||
| if (const int offset = seq.cache_len / block_seq_len_; offset < block_ids.size()) { |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The offset calculation using integer division could lead to data loss if seq.cache_len is not perfectly aligned with block_seq_len_. Consider adding a check or comment to clarify the expected behavior when seq.cache_len % block_seq_len_ != 0.
| if (const int offset = seq.cache_len / block_seq_len_; offset < block_ids.size()) { | |
| // NOTE: seq.cache_len is expected to be aligned to block_seq_len_. The integer division below | |
| // intentionally converts from cached tokens to a whole-block offset; any partial block (non-zero | |
| // remainder) is ignored here and will be effectively truncated. | |
| const int offset = seq.cache_len / block_seq_len_; | |
| if (seq.cache_len % block_seq_len_ != 0 && rank_ == 0) { | |
| TM_LOG_DEBUG( | |
| "[SeqMgr][match] ID %llu, cache_len %d not aligned to block_seq_len_ %d; truncating to %d blocks", | |
| seq.id, seq.cache_len, block_seq_len_, offset); | |
| } | |
| if (offset < static_cast<int>(block_ids.size())) { |
| b.use_count = 1; | ||
| b.unique_id = unique_id_++; | ||
| FT_CHECK(is_active(b)); // post-condition | ||
| b.timestamp = timestamp_++; |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timestamp is now being set during Allocate, but this appears to be a new addition. The timestamp should logically be set when blocks are allocated, but verify this doesn't cause issues with LRU eviction logic, especially if blocks are allocated but not immediately used.
In #4223, the change of scheduling policy causes more cache miss when sending long and repeated requests in a burst. This PR fixes it by